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Abstract 

We consider the problem of analyzing the heterogeneity of clustering distributions for mul- 
tiple groups of observed data, each of which is indexed by a covariate value, and inferring 
global clusters arising from observations aggregated over the covariate domain. We propose 
^ | a novel Bayesian nonparametric method reposing on the formalism of spatial modeling and 

a nested hierarchy of Dirichlet processes. We provide an analysis of the model properties, 
relating and contrasting the notions of local and global clusters. We also provide an efficient 
inference algorithm, and demonstrate the utility of our method in several data examples, in- 
cluding the problem of object tracking and a global clustering analysis of functional data where 
the functional identity information is not available. Q 
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1 Introduction 

In many applications it is common to separate observed data into groups (populations) indexed 
by some covariate u. A particularly fruitful characterization of grouped data is the use of mixture 
distributions to describe the populations in terms of clusters of similar behaviors. Viewing ob- 
servations associated with a group as local data, and the clusters associated with a group as local 
clusters, it is often of interest to assess how the local heterogeneity is described by the changing 
values of covariate u. Moreover, in some applications the primary interest is to extract some sort 
of global clustering patterns that arise out of the aggregated observations. 

Consider, for instance, a problem of tracking multiple objects moving in a geographical area. 
Using covariate u to index the time point, at a given time point u we are provided with a snapshot 
of the locations of the objects, which tend to be grouped into local clusters. Over time, the objects 
may switch their local clusters. We are not really interested in the movement of each individual 
object. It is the paths over which the local clusters evolve that are our primary interest. Such paths 
are the global clusters. Note that the number of global and local clusters are unknown, and are to 
be inferred directly from the locally observed groups of data. 

The problem of estimating global clustering patterns out of locally observed groups of data 
also arises in the context of functional data analysis where the functional identity information 
is not available. By the absence of functional identity information, we mean the data are not 
actually given as a collection of sampled functional curves (even if such functional curves exist 
in reality or conceptually), due to confidentiality constraints or the impracticality of matching the 
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identity of individual functional curves. As another example, the progesterone hormone behaviors 
recorded by a number of women on a given day in their monthly menstrual cycle is associated 
with a local group, which are clustered into typical behaviors. Such local clusters and the number 
of clusters may evolve throughout the monthly cycle. Moreover, aggregating the data over days 
in the cycle, there might exist one or more typical monthly ("global" trend) hormone behaviors 
due to contraception or medical treatments. These are the global clusters. Due to privacy concern, 
the subject identity of the hormone levels are neither known nor matched across the time points 
u. In other words, the data are given not as a collection of hormone curves, but as a collection of 
hormone levels observed over time. 

In the foregoing examples, the covariate u indexes the time. In other applications, the covariate 
might index geographical locations where the observations are collected. More generally, obser- 
vations associated with different groups may also be of different data types. For instance, consider 
the assets of a number of individuals (or countries), where the observed data can be subdivided 
into holdings according to different currency types (e.g., USD, gold, bonds). Here, each u is asso- 
ciated with a currency type, and a global cluster may be taken to represent a typical portforlio of 
currency holdings by a given individual. In view of a substantial existing body of work drawing 
from the spatial statistics literature that we shall describe in the sequel, throughout this paper a co- 
variate value u is sometimes referred to as a spatial location unless specified otherwise. Therefore, 
the dependence on varying covariate values u of the local heterogeneity of data is also sometimes 
referred to as the spatial dependence among groups of data collected at varying local sites. 

We propose in this paper a model-based approach to learning global clusters from locally dis- 
tributed data. Because the number of both global and local clusters are assumed to be unknown, and 
because the local clusters may vary with the covariate u, a natural approach to handling this uncer- 
tainty is based on Dirichlet process mixtures and their variants. A Dirichlet process DP(a , G ) de- 
fines a distribution on (random) probability measures, where a is called the concentrati on param- 
eter, and parameter Go denotes the base probability measure or centering distribution ([Ferguson , 



19731) . A random draw G from the Dirichlet process (DP) is a di screte measure (wit h probability 



1), which admits the well-known "stick-breaking" representation (|SethuramanL 1 1 994|) : 

oo 
k=l 

where the are independent random variables distributed according to G , 5^ denotes an 
atomic distribution concentrated at (f)/., and the stick breaking weights ix^ are random and depend 
only on parameter a - Due to the discrete nature of the DP realizations, Dirichlet processes and 
their variants have become an effective tool in mixture modeling and learning of clustered data. 
The basic idea is to use the DP as a prior on the mixture components in a mixture model, where 
each mixture component is associated with an atom in G. The posterior distribution of the atoms 
provides the probability distribution on mixture components, and also yields a probability distri- 
bution of partitions of the data. The resultan t mixture model, generally known as the Dirichlet 
process mixture, w as pioneered by the work of Antoniak (1974) and subsequentially d eveloped by 
many others (e.g., (|LoUl984l : Escobar and Westll 19951 : LMacEachern and Muelleru 19981) ). 

A Dirichlet process (DP) mixture can be utilized to model each group of observations, so 
a key issue is how to model and assess the local heterogeneity among a collection of DP mix- 
tures. In fact, there is an extensive literature in Bayesian nonparametrics that focuses on coupling 
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multiple Dirichlet p r ocess mixture distributions (e.g., MacEachernj (|1999|) : iMueller et al.l d2004); 



Delorio et al.l (120041) : llshwaran and Jamesl (12001b : iTeh et al.l (I2006|) ). A common theme has been 



to utilize the Bayesian hierarchical modeling framework, where the parameters are conditionally 
independent draws from a probability distribution. In particular, suppose that the u-indexed group 
is modeled us ing a mixing dis tribution G u . We highlight the hierarchical Dirichlet process (HDP) 
introduced by Teh et al.l ( 2006h . a framework that we shall subsequentially generalize, which posits 
that G u \a , G ~ DP(ct , G ) for some base measure G and concentration parameter a . More- 
over, G is also random, and is distributed according to another DP: G \^f, H ~ DP(7, H). The 
HDP model and other aforementioned work are inadequate for our problem, because we are in- 
terested in modeling the linkage among the groups not through the exchangeability assumption 
among the groups, but through the more explicit dependence on changing values of a covariate u. 

Coupling mul tiple DP-distributed mixture distributions can be described under a general frame- 
work outlined by iMacEachernl (|1999|) . In this framework, a DP-distributed random measure can 
be represented by the random "stick" and "atom" random variables (see Eq. (OQ)), which are gen- 
eral stochastic processes indexed by u £ V. Starting fro m this representat i on, there are a num- 



ber of proposals for co-varying infinit e mixture models (Duan etaU 120071 : iPetrone et al. . 2009: 



Rodriguez et all 1201 0t iDunsonl 120081 : iNguyen and Gelfandl |2010|) . These proposals were de- 



signed for functional data only, i.e., where the data are given as a collection of sampled functions 
of u, and thus not suitable for our problem, becau se functional identity information is assumed un - 
known in our setting. In this r egard, the work of lGriffin and Steell(|2006l) : lDunson and Park (2008); 



Rodriguez and Dunsonl (120091) are somewhat closer to our setting. These authors introduced spa- 



tial depende ncy of the local DP mixtures t hrough the stick variables in a number of interesting 
ways, while iRodriguez and Dunson (|2009|) additionally considered spatially varying atom vari- 
ables, resulting in a flexible model. These work focused mostly on the problem of interpolation 
and prediction, not clustering. In particular, they did not consider the problem of inferring global 
clusters from locally observed data groups, which is our primary goal. 

To draw inferences about global clustering patterns from locally grouped data, in this paper we 
will introduce an explicit notion of and model for global clusters, through which the dependence 
among locally distributed groups of data can be described. This allows us to not only assess the de- 
pendence of local clusters associated with multiple groups of data indexed by u, but also to extract 
the global clusters that arise from the aggregated observations. From the outset, we use a spatial 
stochastic process, and more generally a graphical model H indexed over u £ V to characterize 
the centering distribution of global clusters. Spatial stochastic p rocess and graphical models are 
versatile and customary choice for modeling of multivariate data (|Cressie . 1993: LauritzenL 11996: 



JordanL |2004|) . To "link" global clusters to local clusters, we appeal to a hierarchical and non- 



parametric Bayesian formalism: The distribution Q of global clusters is random and distributed 
according to a DP: Q\H ~ DP(7,if). For each u, the distribution G u of local clusters is as- 

sumed random, and is distributed according to a DP: G U \Q ~ DP(a u , Q u ), where Q u denotes 
the marginal distribution at u induced by the stochastic process Q. In other words, in the first stage, 
the Dirichlet process Q provides support for global atoms, which in turn provide support for the 
local atoms of lower dimensions for multiple groups in the second stage. Due to the use of hier- 
archy and the discreteness property of the DP realizations, there is sharing of global atoms across 
the groups. Because different groups may share only disjoint components of the global atoms, the 
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spatial dependency among the groups is induced by the spatial distribution of the global atoms. We 
shall refer to the described hierarchical specification as the nested Hierarchical Dirichlet process 
(nHDP) model. 

The idea of incorporating spatial d ependence in the base measure of Dirichle t proce sses goes 
back to lCifarelli and Regazzinil(ll978b : lMuliere and Petronddl993b : lGelfand et alJd2005h . although 
not in a fully nonparametric hierarchical framework as is considered here. The proposed nHDP 
is an in stantiation of th e non parametric and hierarchical modeling philosophy eloquently advo- 
cated in lTeh and Jordanl (|2010|) . but there is a crucial distinction: Whereas Teh and Jordan gener- 
ally a dvocated for a re cursive construction of Bayesian hierarchy, as exemplified by the popular 
HDP (|Teh et all 120061) . the nHDP features a richer nested hierarchy: instead of taking a joint dis- 
tribution, one can take marginal distributions of a random distribution to be the base measure to 
a DP in the next stage of the hierarchy. This feature is essential to bring about the relationship 
between global clusters and local clusters in our model. In fact, the nHDP generalizes the HDP 
model in the following sense: If H places a prior with probability one on constant functions (i.e., 
if (j) = (0 u ) ue 

v ~ H then <\> u = (f) v \lu, v G V) then the nH DP is reduced to the HDP. 

Most closely related to our work is the hybrid DP of IPetrone et al.1 (|2009|) . which also considers 
global and local clustering, and which in fact serves as an inspiration for this work. Because 
the hybrid DP is designed for functional data, it cannot be applied to situations where functional 
(curve) identity information is not available, i.e., when the data are not given as a collection of 
curves. When such functional id information is indeed available, it makes sense to model the 
behavior of individual curves directly, and this ability may provide an advantage over the nHDP. 
On the other hand, the hybrid DP is a rather complex model, and in our experiment (see Section[5]), 
it tends to overfit the data due to the model complexity. In fact, we show that the nHDP provides a 
more satisfactory clustering performance for the global clusters despite not using any functional id 
information, while the hybrid DP requires not only such information, it also requires the number of 
global clusters ("pure species") to be pre- specified. It is worth noting that in the proposed nHDP, 
by not directly modeling the local cluster switching behavior, our model is significantly simpler 
from both viewpoints of model complexity and computational efficiency of statistical inference. 

The paper outline is as follows. Section [2] provides a brief background of Dirichlet processes, 
the HDP, and we then proceed to define the nHDP mixture model. Section [3] explores the model 
properties, including a stick-breaking characterization, an analysis of the underlying graphical and 
spatial dependency, a Polya-urn sampling characterization. We also offer a discussion of a rather 
interesting issue intrinsic to our problem and the solution, namely, the conditions under which 
global clusters can be identified based on only locally grouped data. As with most nonparametric 
Bayesian methods, inference is an important issue. We demonstrate in Section|4]that the confluence 
of graphical/spatial with hierarchical modeling allows for efficient computations of the relevant 
posterior distributions. Section [5] presents several experimental results, including a comparison to 
a recent approach in the literature. Section [6] concludes the paper. 
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2 Model formalization 



2.1 Background 



We start with a brief backgrou nd on Dirichlet p rocesses (IFergusonl 119731) . and then proceed to 
hierarchical Dirichlet processes dTeh et alll2006h . Let (9 , B, G Q ) be a probability space, and ao > 
0. A Dirichlet process DP(ao, Go) is defined to be the distribution of a random probability measure 
G over (0 O , B) such that, for any finite measurable partition (Ai, . . . , A r ) of Oo, the random vector 
(G(Ai), . . . ,G(A r )) is distributed as a finite dimensional Dirichlet distribution with parameters 
(a G (Ax), . . . , a G (A r )). a is referred to as the concentration parameter, which governs the 
amount of variability of G around the centering distribution G . A DP-distributed probability 
mea sure G is discret e with probability one. Moreover, it has a constructive representation due 



Sethuramanl (119941) : G = J2T=i 7l kS ( f >k , where (4>k}kLi are iid draws from G , and 5 ( p k denotes an 



to 

atomic probability measure concentrated at atom The elements of the sequence 7r = (irk)kLi 
are referred to as "stick-breaking" weights, and can be expressed in terms of independent beta 
variables: itk = ^n*=i(l ~ ^D' wnere (^Dz^i are iid draws from Beta(l,o;o). Note that 7r 
satisfies J2T=i = 1 with probability one, and can be viewed as a random probabity measure 
on th e positive integers. For notational convenience, we write n ~ GEM(a )> following iPittman 
d2002h . 

A useful viewpoint for the Dirichlet process is given by the Polya urn scheme, which shows 
that draws from the Dirichlet process are both discrete and exhibit a clustering property. From 
a computational perspective, the Polya urn scheme provides a method for sampling from the 
random distribution G, by integrating out G. More concretely, let atoms 9\, 9^, ■ ■ • are iid ran- 
dom variables distributed accord ing to G. Because G is random, 9\, 6 2 , . . . are exchangeable. 
Blackwell and MacOueenl (| 1973b showed that the conditional distribution of 0$ given 9\, . . . , 
has the following form: 



. . . , 6i-i, oto, Gc 



i-l 



CXq 



Se t + -. — — — >^o 
i - 1 + a 



Gn. 



This expression shows that 9^ has a positive probability of being equal to one of the previous 
draws 6x, ■ ■ ■ , Moreover, the more often an atom is drawn, the more likely it is to be drawn 
in the future, suggesting a clustering property induced by the random measure G. The induced 
distri bution over random partitions of is also known as the Chinese restaurant process (lAldousL 



1985). 



A Dirichlet process mixture model utilizes G as the prior on the mixture component 0. Com- 
bining with a likelihood function P(y\0) = F(y\0), the DP mixture model is given as: 0i\G ~ G; 



yi\0i ~ F{-\9j). Such mixture models have been st udied in the pioneering work of lAntoniak 



(119741) and subsequentially by a number of authors (|Lolll984l : lEscobar and Westl . ll995uMacEachern and Mueller 
19981), For more recen t and elegant acco unts on the theories and wide-ranging applications of DP 



mixture modeling, see lHjort et al.1 (|2010|) . 



Hierarchical Dirichlet Processes. Next, we proceed giving a brief background on the HDP 



formalism of iTeh et al.l (120061) . which is typically motivated from the setting of grouped data. 
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Under this setting, the observations are organized into groups indexed by a covariate u E V, where 
V is the index set. Let y ul , y u2 , . . . , y uriu be the observations associated with group u. For each 
u, the {y U i}i are assumed to be exchangeable. This suggests the use of mixture modeling: The 
y ui are assumed identically and independently drawn from a mixture distribution. Specifically, 
let Q U i E Q u denote the parameter specifying the mixture component associated with y ui . Under 
the HDP formalism, O u is the same space for all u E V, i.e., Q u = O for all u, and 9 is 
endowed with the Borel a-algebra of subsets of 6 . ui is referred to as local factors indexed by 
covariate u. Let F(-\9 ui ) denote the distribution of observation y ui given the local factor 6 ui . Let 
G u denote a prior distribution for the local factors (#ui)i^i- We assume that the local factors # ui 's 
are conditionally independent given G u . As a result we have the following specification: 

9 ui \G u ~ G u ; y ui \9 ui ~ F(-\6 ui ), for any u E V;i = 1, . . . , n u . (2) 

Under the HDP formalism, to statistically couple the collection of mixing distributions G u , we 
posit that random probability measures G u are conditionally independent, with distributions given 
by a Dirichlet process with base probability measure Go: 

G u \a , G ~ DP(a , G ). 

Moreover, the HDP framework takes a fully nonparametric and hierarchical specification, by posit- 
ing that G is also a random probability measure, which is distributed according to another Dirich- 
let process with concentration parameter 7 and base probability measure H: 

G |7,#~DP( 7 ,#). 

An interesting property of the HDP is that because G u 's are discrete random probability measures 
(with probability one) whose support are given by the support of G . Moreover, G is also a 
discrete measure, thus the collection of G u are random discrete measures sharing the same count- 
able support. In addition, because the random partitions induced by the collection of 6 ui within 
each group u are distributed according to a Chinese restaurant process, the collection of these 
Chinese restaurant processes are statistically coupled. In fact, they are exchangeable, and the 
distributio n for the collec tion of such stoschastic processes is known as the Chinese restaurant 
franchise dTeh et all I2OO6I) . 

2.2 Nested hierarchy of DPs for global clustering analysis 

Setting and notations. In this paper we are interested in the same setting of grouped data as that of 
the HDP that is described by Eq. ©. Specifically, the observations y ul , y u2 , . . . , y unu within each 
group u are iid draws from a mixture distribution. The local factor ui E Q u denotes the parameter 
specifying the mixture component associated with y ui . The (0 U i)i^i are iid draws from the mixing 
distribution G u . 

Implicit in the HDP model is the assumptions that the spaces Q u all coincide, and that random 
distributions G u are exchangeble. Both assumptions will be relaxed. Moreover, our goal here is 
the inference of global clusters, which are associated with global factors that lie in the product 
space := fluey ®«- To this end, 6 is endowed with a cr-algebra B to yield a measurable space 
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(9, B). Within this paper and in the data illustrations, 6 = IR y , and B corresponds to the Borel 
a-algebra of subsets of M. v , Formally, a global factor, which are denoted by tp or cp in the sequel, 
is a high dimensional vector (or function) in whose components are indexed by covariate u. That 
is, t/> = (ipu)uev £ @> an d <fi = (4> u )ue_v G O. As a matter of notations, we always use i to denote 
the numbering index for 9 U (so we have 6 U i). We always use t and k to denote the number index 
for instances of i/>'s and </>'s, respectively (e.g., i\) t and <fi k ). The components of a vector i\) t (4> k ) 
are denoted by i\) ut (4> uk ). We may also use letters v and w beside u to denote the group indices. 



Model description. Our modeling goal is to specify a distribution Q on the global factors ifr, and 
to relate Q to the collection of mixing distributions G u associated with the groups of data. Such 
resultant model shall enable us to infer about the global clusters associated with a global factor -0 
on the basis of data collected locally by the collection of groups indexed by u. At a high level, 
the random probability measures Q and the G u 's are "glued" together under the nonparametric 
and hierarchical framework, while the probabilistic linkage among the groups are governed by a 
stochastic process 4> — {<t>u)u£V indexed by u 6 V and distributed according to H. Customary 
choices of such stochastic processes include either a spatial process, or a graphical model H. 

Specifically, let Q u denote the induced marginal distribution of ip u . Our model posits that for 
each u E V, G u is a random measure distributed as a DP with concentration parameter a u , and base 
probability measure Q u : G u \a u ,Q ~ DP(a u ,Q u ). Conditioning on Q, the distributions G u are 
independent, and G u varies around the centering distribution Q u , with the amount of variability 
given by a u . The probability measure Q is random, and distributed as a DP with concentration 
parameter 7 and base probability measure H: Q\^,H ~ DP(7,i/), where H is taken to be a 
spatial process indexed by u E V, or more generally a graphical model defined on the collection 
of variables indexed by V. In summary, collecting the described specifications gives the nested 
Hierarchical Dirichlet process (nHDP) mixture model: 

Q\-y,H ~ DP( 7 ,tf), 

G u \a u , Q m ~ P DP(a u , Q u ), for all u E V 

U i\G u ~ G u , y U i\0 U i ~ for all u,i, 

As we shall see in the next section, the </> fe 's, which are draws from H, provide the support for 
global factors ijj t ~ Q, which in turn provide the support for the local factors 6 ui ~ G u . The global 
and local factors provide distinct representations for both global clusters and local clusters that we 
envision being present in data. Local factors 9 ui 's provide the support for local cluster centers at 
each u. The global factors xp in turn provide the support for the local clusters, but they also provide 
the support for global cluster centers in the data, when observations are aggregated across different 
groups. 



Relations to the HDP. Both the HDP and nHDP are instances of the n onparametric and hi erar- 
chical modeling framework involving hierarchy of Dirichlet processes (|Teh and Jordanll2010|) . At 
a high-level, the distinction here is that while the HDP is a recursive hierarchy of random prob- 
ability measures generally operating on the same probability space, the nHDP features a nested 
hierarchy, in which the probability spaces associated with different levels in the hierarchy are dis- 
tinct but related in the following way: the probability distribution associated with a particular level, 
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say G u , has support in the support of the marginal distribution of a probability distribution (i.e., 
Q) in the upper level in the hierarchy. Accordingly, for u ^ v, G u and G v have support in distinct 
components of vectors if). For a more explicit comparison, it is simple to see that if H places 
distribution for constant global factors <f> with p robability one (e.g., for any 4> ~ H there holds 
4> u = (p v Wu, v G V), then we obtain the HDP of iTeh et all d2006h . 



3 Model properties 

3.1 Stick-breaking representation and graphical or spatial dependency 

Given that the multivariate base measure Q is distributed as a Dirichlet process, it can be expressed 
using Sethuraman's stick-breaking representation: Q = Y^T=i Pk$4> k ■ Each atom 4> k is multivariate 
and denoted by cf> k = (<fi uk : u G V). The 4> k s axe independent draws from H, and (3 = (f3j i )'j*L 1 ~ 
GEM(7). The (fi k s and (3 are mutually independent. The marginal induced by Q at each location 
u G V is: Q u = Y,kLiPk$<l> uk - Since each Q u has support at the points (<f> u k)kLi> eacn G u 
necessarily has support at these points as well, and can be written as: 

oo oo 

G u = ^7i uk 5 4>uk \ Qu = ^2(3 k S^ uk . (3) 
fc=i fc=i 

Let 7r n = (iiuk)^^. Since G u 's are independent given Q, the weights 7r n 's are independent 
given /3. Moreover, because G u \a u ,Q ~ DP(q, M Q„.) it is possible to derive the relationship 



between weights tz u 's and (3. Following ITeh et al.l (|2006|) . if H is non-atomic, it is necessary 



and sufficient for G u defined by Eq. © to satisfy G u ~ DP(a u Q u ) that the following holds: 
ir u ~ DP(a M , (3), where n u and (3 are interpreted as probability measures on the set of positive 
integers. 



The connection between the nHDP and the HDP of ITeh et al.l (|2006) can be observed clearly 
here: The stick-breaking weights of the nHDP-distributed G u have the same distributions as those 
of the HDP, while the atoms 4> uk are linked by a graphical model distribution, or more generally a 
stochastic process indexed by u. 

The spatial/graphical dependency given by base measure H induces the dependency between 
the DP-distributed G u 's. We shall explore this in details by considering specific examples of H. 
Example 1 (Graphical model H). For concreteness, we consider a graphical model H of three 
variables (f) u , (f) v , <p w which are associated with three locations u,v,w G V. Moreover, assume the 
conditional independence relation: <\> u _L 4> w \4> v - Let i\) = (ijj u , ip v , ip w ) be a random draw from Q. 
Because Q ~ DP(7, H), if) also has distribution H once Q is integrated out. Thus, ?p u _L ip w \if) v . 

At each location u G V, the marginal distribution Q u of variable -ip u is random and Q u \7, H 
~ DP (7, H u ). Moreover, in general the Q u 's are mutually dependent regardless of any (condi- 
tional) independence relations that H might confer. This fact can be easily seen from Eq. ©. With 
probability 1, all Q u 's share the same (3. It follows that Q u _L Q W \Q V , (3. Because (3 is random, 
the conditional independence relation no longer holds among Q u , Q w , Q v in general. From a mod- 
eling standpoint, the dependency among the QuS is natural for our purpose, as Q provides the 
distribution for the global factors associated with the global clusters that we are also interested in 
inferring. 
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Figure 1. Left: The nHDP is depicted as a graphical model, where each unshaded node represents a 
random distribution. Right: A graphical model representation of the nHDP using the stick-breaking 
parameterisation. 



Turning now to distributions G u for local factors 9 ui , we note that G U ,G V , G w are independent 
given Q. Moreover, for each u E V , the support of G u is the same as that of Q u (i.e., 9 ui for 
i = 1,2,... take value among (^ u t)t^i)- Integrating over the random Q, for any measurable 
partition A c 6 U , there holds: E[G U (A)\H] = E[E[G U (A)\Q]\H] = E[Q U (A)\H] = H U {A). In 
sum, the global factors ?/>'s take values in the set of (0 fc )^ 1 ~ H, and provide the support set for 
the local factors 9 ui 's at each u E V. The prior means of the local factors 9 U iS are also derived 
from the prior mean of the global factors. 

Example 2 (Spatial model H). To quantify more detailed dependency among DP-distributed 
G u 's, let V be a finite subset of W and H be a second-order stochastic process indexed by v E V. 
A customary choice for if is a Gaussian process. In effect, </> = (<p u : u E V) ~ A r (/x, £), where 
the covariance S has entries of the exponential form: p(u, v) = a 2 exp — {co\\u — v\\ }. 

For any measurable partitions A C ©„, and B C Q v , we are interested in expressions for 
variation and correlation measures under Q and G u 's. Let H UV (A, B) := p(4> u E A,<fi v E B\H). 
Define #(7) = 1/(7 + 1). Applying stick-breaking representation for Q u , it is simple to derive 
that: 

Proposition 1. For any pair of distinct locations u, v), there holds: 

Cov(Q u (A),Q v (B)\H) = g( 1 )(H uv (A,B)-H u (A)H v (B)), (4) 
Vm(Q u (A)\H) = g(^i)(H u (A) — H U (A) 2 ), (5) 

Cov(Q u (A),Q v (B)\H) 



Con(Q u (A),Q v (B)) 



Vsx(Q u (A)\Hy/ 2 Vea(Q v (B)\Hy/ 2 
(H UV (A,B)-H U (A)H V (B)) 
(H U (A) - H u (A) 2 y/ 2 (H v (B) - H v (B) 2 y/ 2 - 



(6) 



For any pair of locations u,v E V, if \\u — v\\ — > 00, it follows that p(w, v) = Cov(0 u , 4> V \H) 
— > 0. Due to standard properties of Gaussian variables, 4> u and <f) v become less dependent of each 
other, and H UV (A, B) - H U (A)H V (B) 0, so that Corr(Q u (A), Q V {B)) 0. On the other hand, 
if u — v — > 0, we obtain that Cott(Q u (A) : Q v (A)) — > 1, as desired. 

Turning to distributions G u 's for the local factors, the following result can be shown: 
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Proposition 2. For any pair ofu,v G V, there holds: 



Vzr(G u (A)\H) 



E [Var (G u ( A) | Q) \ H] + Var (E [G u ( A) \ Q) \ H) 
{g{i) + g(a u ) - <7( 7 )<?K))(# U (A) - H U (A) 2 ), 
g(j)Carr(Q u {A),Q v {B)\H) 



(7) 



Cott(G u (A),G v (B)) 



(gh) + g(au) - gh)g(au)y/ 2 (g(i) + g(a v ) - g(-f)g(a v )y/ 2 ■ 



where g(a u ) = l/(a u + 1). 

Eq. © exhibits an interesting decomposition of variance. Note that V&r(G u (A)\H) > V&r(Q u (A)\H). 
That is, the variation of a local factor is greater than that of the global factor evaluated at the same 
location, where the extra variation is governed by concentration parameter a u . If a u — > oo so that 
g(a u ) — > 0, the local variation at u disappears, with the remaining variation contributed by the 
global factors only. If a u — > so that g(a u ) — > 1, the local variation contributed by G u completely 
dominates the global variation contributed by Q u . 

Finally, turning to correlation measures in the two stages in our hierachical model, we note that 
Corr (G U (A), G V (B)\H) < Cott(Q u (A), Q v (B)\H). That is, the correlation across the locations 
in V among the distributions G„'s of the local factors is bounded from above by the correlation 
among the distribution Q u 's for the global factors. Note that Corr (G u ( A), G V (B)) vanishes as 
1 1 it — v\\ — > oo. The correlation measure increases as either a u or a v increases. The dependence on 
7 is quite interesting. As 7 ranges from to 00 so that (7(7) decreases from 1 to 0, and as a result 
the correlation measure ratio Corr(G u (A), G V (B)) /Corr (Q U (A), Q V (B)) decreases from 1 to 0. 

3.2 Polya-urn characterization 

The Polya-urn characterization of the canonical Dirichlet process is fully retained by the nHDP It 
is also useful in highlighting both local clustering and global clustering aspects that are described 
by the nHDP mixture. In the sequel, the Polya-urn characterization is given as a sampling scheme 
for both the global and local factors. Recall that the global factors (f) 1 , </> 2 , . . . are i.i.d. random 
variables distributed according to H. We also introduced random vectors 1/^ which are i.i.d. draws 
from Q. Both <p k and tp t are multivariate, denoted by <fr k = {4> u k) u &v and ip t = (ip u t)uev- Finally, 
for each location u G V, the local factor variables 9 ui are distributed according to G u . 

Note that each ip t is associated with one 4> k , and each 9 ui is associated with one ip ut . Let t ui be 
the index of the r tp ut associated with the local factor 9 ui , and k t be the index of the <ft k associated 
with the global factor x^> t . Let K be the present number of distinct global factors (f) k . The sampling 
process starts with K = and increases K as needed. We also need notations for counts. We 
use notation n ut to denote the present number of local factors 6 u i taking value ip ut . n u denotes the 
number of local factors at group u (which is also the number of observations at group u). n u . k is 
the number of local factors at u taking value <\> uk . Let m u denote the number of factors i\) t that 
provide supports for group u. The notation q k denotes the number of global factors i/> t 's taking 
value cj) k , while q. denotes the total number of global factors i/^'s- To be precise: 





t 



10 



First, consider the conditional distribution for 6 ui given 9 U \, 6 u2} . . . , O u ,i-i, and Q, where the 
G u is integrated out: 

<9 ~ > , -t^ — <W + — — (8) 

z - 1 + a« 2 - 1 + a u 

This is a mixture, and a realization from this mixture can be obtained by drawing from the terms on 
the right-hand side with probabilities given by the corresponding mixing proportions. If a term in 
the first summation is chosen, then we set 9 ui = ip ut for the chosen t, and let t ui = t, and increment 
n ut . If the second term is chosen, then we increment m u by one, draw ip urriu ~ Q u . In addition, 
we set 9 ui = Vw u , and t ui = m u . 

Now we proceed to integrate out Q. Since Q appears only in its role as the distribution of the 
variable ij) t , we only need to draw sample %j) t from Q. The samples from Q can be obtained via 
the conditional distribution of ij) t as follows: 

K 

^m^H-J^-^-S^ + -L-H. (9) 

If we draw xj) t via choosing a term in the summation on the right-hand side of this equation, we set 
ip t = (f) k , and let k t = k for the chosen k, and increment q k . If the second term is chosen then we 
increment K by one, draw <f> K ~ H and set ip t = <p K , kj t = K, and qx = 1. 

The Polya-urn characterization of the nHDP can be illustrated by the following culinary metaphor. 
Suppose that there are three groups of dishes (e.g., appetizer, main course and dessert) indexed by 
u, v and w. View a global factor ^ fc 's as a typical meal box where each 0^, 0^ and 4> w k is asso- 
ciated with a dish group. In an electic eatery, the dishes are sold in meal boxes, while customers 
come in, buy dishes and share among one another according to the following process. A new cus- 
tomer can join either one of the three groups of dishes. Upon joining the group, she orders a dish 
to contribute to the group, i.e., a local factor 9 ui , based on its popularity within the group. She can 
also choose to order a new dish, but to do so, she needs to order the entire meal box, i.e. a global 
factor ip t . A meal box is chosen based on its popularity as a whole, across all eating groups. 

The "sharing" of global factors (meal box) across indices u can be seen by noting that the 

"pool" of present global factors {ifii} has support in the discrete set of global factor values l5 (f) 2 , 

Moreover, the spatial (graphical) distribution of the global factors induces the spatial dependence 
among local factors associated with each group indexed by u. See Fig.[2]for an illustration. 



3.3 Model identifiability and complexity 

This section investigates the nHDP mixture's inferential behavior, including issues related to the 
model identifiability. It is use ful to recall that a DP mixture model can b e viewed as the infinite 
limit of finite mixture models (|Neallll992l : llshwaran and Zarepouru2002b|) . The nHDP can also be 
viewed as the limit of a finite mixture counterpart. Indeed, consider the following finite mixture 
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Figure 2. Illustration of the assignments of mixture component membership via global and local 
factor variables for two groups indexed by u and v. 



model: 



/3|7 ~ Dir(7/L, . . .7/L) Tz u \a u , (3 ~ Dir(a u /3) (f> k ~ H 

L L 



(10) 



fe=l 



fe=l 



It is a known fact that as L — > 0, Q L =>- Q weakly, in the sense that for any real-valued bounded 
and c ontinuous function g, there holds J g dQ L J g dQ in distribution (|Muliere and Secchil . 
Il995h . § Because for each u e V, there holds G L U ~ DP(a u Q L ), it also follows that => G u 
weakly. The above characterization provides a convenient means of understanding the behavior of 
the nHDP mixture by studying the behavior of its finite mixture counterpart with L global mixture 
components, as L — > 00. 

Information denseness of nHDP prior. For concreteness in this section we shall assume that 
for any u E V the likelihood F(y u \(f> u ) is specified by the normal distribution whose parameters 
such as mean and variance are represented by 4> u . Write 4> u = (/x u , a%) 6 (Rx Recall that 
conditionally on Q, G u 's are independent across u E V. Given G u , the marginal distribution on 
observation y u has the following density: 



f u {y u \G u ) 



F(y u \cf> u )dG u ( 



(11) 



Thus, each f u is the density of a location-scale mixture of normal distribution. The / u 's are random 
due to the randomness of GVs. In other words, the nHDP places a prior distribution, which we 
denote by n, over the collection of random measures (G u ) u( zy This in turn induces a prior over 
the joint density of y := (y u ) ue v, which we call II as well. Replacing the mixing distributions Q 
and G u by the finite mixture Q L and G^'s (as specified by Eq. (fTOl)). we obtain the corresponding 



2 A stronger result was obtained by llshwaran and Zarepoud d2002bl) . Theorem 2, in which convergence holds for 
any integrable function g with respect to H. 
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marginal density: 

fu{Vu\Gu) = / F(y u \<f> u )dGt(<!>u). (12) 



Let U L to denote the induced prior distribution for {fu}uev- From the above, n L =>• II weakly. 

We shall show that for each u E V the prior U L is information dense in the space of finite 
mixtures as L — > oo. Indeed, for any group index u, consider any finite mixture of normals f U)0 
associated with mixing distributions Q and G u of the form: 

d d 
Qo — /J fffc.o'Wo) ^,o = / ] n u k,o8<t> ukt0 , (13) 

k=l k=l 

Proposition 3. Suppose that the base measure H places positive probability on a rectangle con- 
taining the support of Qo, then the prior II ^ places a positive probability in arbitrarily small 
Kullback-Leibler neighborhood of f u .ofar L sufficiently large. That is, for any e > 0, there holds: 
U L {fu ■ D(fu,o\\fu) < e) > for any sufficiently large L. 

At a high level, this result implies that the nHDP provides a prior over the space of mixture 
distributions that is "well spread" in the Kull back-Leibler topology. A proof of this result can be 



obtained using the same proof techniques of llshwaran and Zarepourl (|2002a|) for a similar result 



applied to (non-hierarchical) finite-dimensional Dirichlet distributions, and is therefore omitted. 
An immediate consequence of the information denseness property is the we ak consistency o f the 



posterior distribution of y u for any u G V, thanks to the asymptotic theory of lSchwarta (|1965). 
Identifiability of factors 4>. The above results are relevant from the viewpoint of density estima- 
tion (for the joint vector y). From a clustering viewpoint, we are also interested in the ability of 
the nHDP prior in recovering the underlying local factors (fiu^s, as well as the global factors fc 's 
for the global clusters. This is done by studying the identifiability of the finite mixtures that lie in 
the union of the support of Hl for all L < oo. This is the set of all densities )«eV;L<oo whose 
corresponding mixing distributions are given by Eq. (fTOT ). 

Recall that each marginal is a normal mixture, and the L mixture components are param- 
eterised by U & = (fi u k, cr^ k ) for k = 1, . . . , L. Again, let / Uj o be the "true" marginal density 
of a mixture distribution for group u that has d mixture components, and the associated mixing 
distributions Q and G Uj0 are given by Eq. (fT3b . The parameter for the A;-th component for each 
k — 1, . . . , d is denoted by 4>uk.,n =Jjhik,0i a lk o)- The following is a direct consequence of Theo- 



rem 2 of llshwaran and Zarepoud (|2002af) : 



Proposition 4. Suppose that for any u G V, f u {Vv) — fufiiVu) for almost all y u . In addition, the 
mixing distributions satisfy the following condition: 



[ expf J 1 W(rf</g<oo, 



for any u G V, where cr* = min{cr uli0 , . . . , cr U k,o}- Then, G u = G u forall u G V. 

In other words, this result claims that it is possible to identify all local clusters specified by 
4> u k and ix uk for k = 1, . . . , d, up to the ordering of the mixture component index k. A more 
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substantial issue is the identifiability of global factors. Under additional conditions of "true" global 
factors <p k 's, and the distribution of global factors Q L , the identification of global factors <fi k 's is 
possible. Viewing a global factor <fi k = (4> u k) u &v (likewise, 4> k ) as a function of u E v, a trivial 
example is that when <fi k are constant functions, and that base measure H (and consequentially 
Q L ) places probability 1 on such set of functions, then the identifiability of local factors implies 
the identifiability of global factors. A nontrivial condition is that the "true" global factors <fi k as 
a function of u can be parameterised by a small number of parameters (e.g. a linear function, or 
an appropriately defined smooth function in u E V). Then, it is possible that the identifiability of 
local factors also implies the identifiability of global factors. An in-depth theoretical treatment of 
this important issue is beyond the scope of the present paper. 

The above observations suggest several prudent guidelines for prior specifications (via the base 
measure H). To ensure good inferential behavior for the local factors 0„'s, it is essential that the 
base measure H u places sufficiently small tail probabilities on both fx u and a u . In addition, if it is 
believed the underlying global factors are smooth function in the domain V, placing a very vague 
prior H over the global factors (such as a factorial distribution H = Y[ u &v ^ u by assuming the cf) u 
are independent across u E V) may not do the job. Instead, an appropriate base measure H that 
puts most of its mass on smooth functions is needed. Indeed, these observations are also confirmed 
by our empirical experiments in Section 5. 



4 Inference 

In this section we shall describe posterior inference methods for the nested Hierarchical Dirichlet 
process mixture. We describe two different sampling approaches: The "marginal approach" pro- 
ceeds by integrating out the DP-distributed random measures, while the "conditional approach" 
exploits the stick-breaking representation. The former approach arises directly from the Polya-urn 
characterization of the nHDP. However its implementation is more involved due to book-keeping 
of the indices. Within this section we shall describe the conditional approach, leaving the details 
of the marginal approach to the supplemental material. Both sampling methods draw fro m the ba- 
sic fea tures of the sampling methods developed for the Hierarchical Dirichlet Process of Teh et al.1 



(|2006l) . in addition to the computational issues that arise when high-dimensional global factors are 



sampled. 

For the reader's convenience, we recall key notations and introduce a few more for the sampling 
algorithms. t ui is the index of the ip u t associated with the local factor 9 ui , i.e., 6 ui = ip u t ui ', and k t 
is the index of the 0^ associated with the global factor ij) t , i.e., ip t = <fi kt . The local and global 
atoms are related by 9 ui = ip u t ui = 4> u k t ■ Let z ui = k tui denote the mixture component associated 
with observation y ui . Turning to count variables, denotes the number of local atoms u j's 
that are associated with ip t , excluding atom 9 ui . denotes the number of local atoms 9 u i that 
such that z u i = k, leaving out 6 ui . t~ ul denotes the vector of all t u i's leaving out element t U i. 
Likewise, k 1 denotes the vector of all k r 's leaving out element k t . In the sequel, the concentration 
parameters 7, a u , and parameters for H are assumed fixed. In practice, we also place sta n dard prior 



distrib utions on these param eters, following the approaches of lEscobar and Westi(|1995|) : lTeh et al 
d2006l) for 7, a u , and, e.g., iGelfand et al] d2005h for H's. 



The main idea of the conditional sampling approach is to exploit the stick-breaking represen- 
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tation of DP-distributed Q instead of integrating it out. Likewise, we also consider not integrating 
over the base measure H. Recall that a priori Q ~ DP(7, H). Due to a standard property of the 
posterior of a Dirichlet process, conditioning on the global factors fc 's and the index vector k, Q 

is distributed as DP (7 + q., 7H+ ^ | T I? 1 qkSe ^ k ) . Note that vector q can be computed directly from k. 

Thus, an explicit representation of Q is Q = Y*=i Pk$4> h + /3new<5 new , where Q new ~ DP(7, H), 
and 

(3 = (/3i,...,/3^,/3 new ) ~Dir(gi,...,g fc ,7). 

Conditioning on Q, or equivalently conditioning on (3, <p k s in the stick breaking representa- 
tion, the distributions G u 's associated with different locations u E V are decoupled (indepen- 
dent). In particular, the posterior of G u given Q and k, t and the fc 's is distributed as DP(a u + 

n u , ""^" + ^^i ra "- fc<5 ^fc y Thus, an explicit representation of the conditional distribution of G u is 

given as G» = £f =1 ^fc<W + tt^G^, where G" ew ~ DP(« u /3 new , Q^O and 

) ~ Dir(« u /?i + n u .i, . . . , a u (3 k + n u .^, a u /3 n ew)- 

In contrast to the marginal approach, we consider sampling directly in the mixture component 
variable z U i = k tui , and in doing so we bypass the sampling steps involving k and t. Note that 
the likelihood of the data involves only the z ui variables and the global atoms <p k 's. The mixture 
proportion vector (3 involves only count vectors q = . . . , q K ). It suffices to construct a Markov 
chain on the space of (z, q, (3, </>). 

Sampling (3. As mentioned above, (3\q ~ Dir(gi, . . . , g#, 7). 

Sampling z. Recall that a priori z ui ]7r u , (3 ~ 7r u where 7r w |/3, a u ~ DP(a n , /3). Let denote 
the number of data items in the group u, except y ui , associated with the mixture component k. This 
can be readily computed from the vector z. 

P (z ui = k\z^ q, (3, <t> k , Data) = ( ^ + "ff^^ if k used (14 ) 

where 

Note that if z ui is taken to be k new , then we update K = K + 1. (Obviously, k new takes the value of 
the updated K). 

Sampling q. To clarify the distribution for vector q, we recall an observation at the end of Sec- 
tion !3.2l that the set of global factors t/> t 's can be organized into disjoint subsets *& u , each of which 
is associated with a location u. More precisely, ip t E if and only if n ut > 0. Within each group 
u, let m uk denote the number of i/> t 's taking value 4> k . Then, q k = J2 u &v m uk- 

Conditioning on z we can collect all data items in group u that are associated with mixture 
component <p k , i.e., item indices ui such that z ui = k. There are n u . k such items, which are 
distributed according to a Dirichlet process with concentration parameter a u /3 k . The count variable 
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m uk corresponds to the number of mixture components formed by the n u . k items. It was shown by 
Antoniak (1974) that the distribution of m uk has the form: 



•pirn 



uk 



m\z, rn 



-uk 



/3) 



-s(n u . k ,m)(a u /3 k ) 



where s(n,m) are unsigned Stirling number of the first kind. By definition, s(0, 0) = s(l, 1) = 
1, s(n, 0) = for n > 0, and s(n, m) = for m > n. For other entries, there holds s(n + 1, m) = 

s(n, m — 1) + ns(n, m). 

Sampling cf). The sampling of . . . , (f> k follows from the following conditional probabilities: 
p(tf> k \z, Data) oc H((f> h ) JJ F(y ui \(j) uk ) for each k = 1, . . . , K. 

Let us index the set V by 1, 2, ... , M, where |V| = M. We return to our two examples. 

As the first example, suppose that <p k is normally distributed, i.e., under H, 4> k ~ N(/J, k , H k ), 
and that the likelihood F(y ui \9 U i) is given as well by N{0 ui , of), then the posterior distribution of 
4> k is also Gaussian with mean fi k and variance Yl k , where: 



S* 1 



(J 



r diag(ni. fe , 



o; 



} u yid{zu = k) . . . } u VMiKzMi = k) 



(16) 



For the second example, we assume that <p k is very high dimensional, and the prior distribution H 
is not tractable (e.g., a Markov random field). Direct computation is no longer possible. A simple 
solution is to Gibbs sample each component of vector <p k . Suppose that under a Markov random 
field model H, the conditional probability H(<p uk \ 4> k u ) is simple to compute. Then, for any u E V, 

p((f) uk \(f) k u ,z,DsitSi) cc H((f) uk \(f) k u ) JJ F(y ui \(f) uk ). 



Computation of conditional density of data A major computational bottleneck in sampling 
methods for the nHDP is the computation of conditional densities given by Eq. (TT5T ) and (TT8T ). In 
general, (f> is very high dimensional, and integrating over cf) ~ H is intractable. However it is pos- 
sible to exploit the structure of H to alleviate this situation. As an example, if H is conjugate to F, 
the computation of these conditionals can be achieved in closed form. Alternatively, if H is speci- 
fied as a graphical model where conditional independence assumptions can be exploited, efficient 
inference methods in graphical models can be brought to bear on our computational problem. 
Example 1. Suppose that the likelihood function F is given by a Gaussian distribution, i.e., 
y U i\9ui ~ N(6 U i, of) for ai l u, i, and that the prior H is conjugate, i.e., H is also a Gaussian distri- 
bution: <f> k ~ N(fi k , Sfc). Due to conjugacy, the computations in Eq. (fT8l) are readily available in 
closed forms. Specifically, the density in Eq. (TT8T ) takes the following expression: 

fuk Ul (yui) = ^2n) 1 / 2 a \c k \ 6XP ( ~ 2of ^ + 2^ k + ^ k +^ k + ~ 2^"' ^ k ' 
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Spatially varying mixture distributions Spatially varying mixture distributions 




Locations u Locations u 



Figure 3. Left: Data set A illustrates a simulated problem of tracking particles organized into 
clusters, which move in smooth paths. Right: Data set B illustrates bifurcating trajectories. In both 
cases, data are given not as trajectories, but only as individual points denoted by circles at each u. 



where 



C- k l = S fc 1 + ^diag(n-]f, . . . , 1 + n~l\ . . . , n^J), 



1 



m = u'i') 



diag(n-r,...,n-r 



> • • • ) n M-k)i 



(17) 



It is straightforward to obtain required expressions for f k Vt {y t ), f~ k y ^i{y U i), and f k J?J(y t ) - the 
latter two quantities are given in the Appendix. 

Example 2. If if is a chain-structured model, the conditional densities defined by Eq. (fl~8l ) are 
not available in closed forms, but we can still obtain exact computation usin g an algorithm that is 
akin to the well-known alpha-beta algorithm in the Hidden Markov model (IRabinerl . Il989l) . The 
running time of such algorithm is proportional to the size of the graph (i.e., |V|). For general 
graphical models, one can apply a sum -product algorithm or approximate variational inference 
methods ( Wainwright and Jordan . 2008b . 



5 Illustrations 

Simulation studies. We generate two data sets of spatially varying clustered populations (see 
Fig. [3] for illustrations). In both data sets, we set V = {1, . . . , 15}. For data set A, K = 5 
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Posterior distributions of global cluster centers 
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Figure 4. Data set A. Left: Posterior distribution of the number of global clusters. Right: Poste- 
rior distributions of the global atoms. Dashed lines denote the mean curve and (.05, .95) credible 
intervals. 



Figure 5. Data set B. Left: Posterior distribution of the number of global clusters (atoms). Right: 
Posterior distributions of the global atoms. Dashed lines denote the mean curve and the (.05, .95) 
credible intervals. 

global factors 0i, . . . , 4> 5 are generated from a Gaussian process (GP). These global factors pro- 
vide support for 15 spatially varying mixtures of normal distributions, each of which has 5 mixture 
components. The likelihood F(6 ui ) is given by N(9 U i, erf), cr e = 0.1. For each u we generated 
independently 100 samples from the corresponding mixture (20 samples from each mixture com- 
ponents). Note that each circle in the figures denote a data sample. This kind of data can be 
encountered in tracking problems, where the samples associating with each covariate u can be 
viewed as a snapshot of the locations of moving particles at time point u. The particles move in 
clusters. They may switch clusters at any time, but the identification of each particle is not known 
as they move from one time step to the next. The clusters themselves move in relatively smoother 
paths. Moreover, the number of clusters is not known. It is of interest to estimate the cluster 
centers, as well as their moving paths. For data set B, to illustrate the variation in the number 
of local clusters at different locations, we generate a number of global factors that simulate the 

3 Particle-specific tracking is possible if the identity of the specific particle is maintained across snapshots. 



Posterior distributions of global cluster centers 
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Posterior distributions of global cluster centers 

3i ■ . 





Figure 6. Effects of vague prior for H results in weak identifiability of global clusters, even as the 
local clusters are identified reasonably well. 



Num ol local clusters at loc=1 Num of local clusters at loc=3 Num of local clusters at loc=5 Num ol local clusters at loc=8 
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Num of local clusters at loc=10 Num ol local clusters al loc=1 1 Num of local clusters al loc=13 





Figure 7. Data set B : Posterior distribution of the number of local clusters associating with different 
group index (location) u. 



bifurcation behavior in a collection of longitudinal trajectories. Here a trajectory corresponds to a 
global factor. Specifically, we set V = {1, . . . , 15}. Starting at u = 1 there is one global factor, 
which is a random draw from a relatively smooth GP with mean function fj,(u) = P^u, where 
~ Unif(— 0.2, 0.2) and the exponential covariance function parameterised by a = 1, uj = 0.05. 
At u = 5, the global factor splits into two, with the second one also an independent draw from the 
same GP, which is re-centered so that its value at u = 4 is the same as the value of the previous 
global factor at u = 4. At u = 10, the second global factor splits once more in the same manner. 
These three global factors provide support for the local clusters at each u E V. The likelihood 
F(-\9 U i) is given by a normal distribution with a e = 0.2. At each u we generated 30 independent 
observations. 

Although it is possible to perform clustering analysis for data at each location u, it is not clear 
how to link these clusters across the locations, especially given that the number of clusters might be 
different for different it's. The nHDP mixture model provides a natural solution to this problem. It 
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Progresterone hormone curves 




Figure 8: Progeresterone hormone curves. 
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Fi gure 9. Clusterin g results using the nHDP mixture model (Left), and the hybrid-DP 
of Petrone et al. (2009) (Right). Mean and credible intervals of global clusters (in dashed lines) 
are compared to sample mean curves of the contraceptive group and no contraceptive group in 
black solid with square markers. 



is fit for both data sets using essentially the same prior specifications. The concentration parameters 
are given by 7 ~ Gamma(5, .1) and a ~ Gamma(20, 20). H is taken to be a mean-0 GP using 
(a, u) = (1, 0.01) for data set A, and (1, 0.05) for data set B. The variance cr^ is endowed with 
prior InvGamma(5, 1). The results of posterior inference (via MCMC sampling) for both data sets 
are illustrated by Fig. 0] and Fig. [5] With both data sets, the number global clusters are estimated 
almost exactly (5 and 3, respectively, with probability > 90%). The evolution of the posterior 
distributions on the number of local clusters for data set B is given in Fig. U\ In both data sets, 
the local factors are accurately estimated (see Figs. |4] and [5]). For data set B, due to the varying 
number of local clusters, there are regions for u, specifically the interval [5, 10] where multiple 
global factors alternate the role of supporting local clusters, resulting in wider credible bands. 

In Section [3] we discussed the implications of prior specifications of the base measure H for 
the identifiability of global factors. We have performed a sensitivity analysis for data set A, and 
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Figure 10. Pairwise comparison of individual hormone curves. Each entry in the heatmap depicts 
the posterior probability that the two curves share the same local clusters, averaged over a fixed 
interval ([1,20] in the left, and [21,24] in the right figure) in the menstrual cycle. 
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Figure 11. The leftmost panel shows the posterior distribution of the number of global clusters, 
while remaining panels show the the number of local clusters associating with group index it. 



found that the inference for global factors is robust when oj is set to be in [.01, .1]. For uj = 0.5, 
for instance, which implies that <p u are weakly dependent across it's, we are not able to identify the 
desired global factors (see Fig. [6]), despite the fact that local factors are still estimated reasonably 
well. 

The effects of prior specification for a e on the inference of global factors are somewhat similar 
to the hybrid DP model: a smaller a e encourages higher num bers of and less smooth glob al curves 
to expand the coverage of the function space (see Sec. 7.3 of [Nguyen and Gelfandl (|2010l) ). Within 
our context, the prior for a t is relatively more robust than that of to as discussed above. The prior 
for concentration parameter 7 is extremely robust while the priors for a u 's are somewhat less. We 
believe the reason for this robustness is due to the modeling of the global factors in the second 
stage of the nested hierarchy of DPs, and the inference about these factors has the effect of pooling 
data from across the groups in the first stage. In practice, we take all a u 's to be equal to increase 
the robustness of the associated prior. 



Progesterone hormone clustering. We turn to a clustering analysis of Progesterone hormone 
data. This data set records the natural logarithm of the progesterone metabolite, measured by 
urinary hormone assay, during a monthly cycle for 51 female subjects. Each cycle ranges from -8 
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Figur e 12. Pairwise comparison of individual hormone curves using the hybrid-DP dPetrone et al. , 
2009). Each entry in the heatmap depicts the posterior probability that the two curves share the 
same local clusters, averaged over a fixed interval ([1,20] in the left, and [21,24] in the right figure) 
in the menstrual cycle. 



to 15 (8 days pre-ovulation to 15 days post-ovulation). We are interested in clustering the hormone 
levels per day, and assessing the evolution over time. We are also interested in global clusters, 
i.e., identifying global hormone pattern for the entire monthly cycle and ana lyzing the effects on 
contr aception on the clustering patterns. See Fig. [8] for the illustration and iBrumback and Rice 
(|l998|) for more details on the data set. 

For prior specifications, we set 7 ~ Gamma(5, 0.1), and a u = 1 for all u. Let a t ~ InvGamma(2, 
For H, we set fx = 0, a = 1 and to = 0.05. It is found that the there are 2 global clusters with 
probability close to 1. In addition, the mean estimate of global clusters match very well with the 
sample means from the two groups of women, a group of those using contraceptives and a group 
that do not (see Fig. [9]). Examining the variations of local clusters, there is a significant probability 
of having only one local cluster during the first 20 days. Between day 21 and 24 the number of 
local clusters is 2 with probability close to 1. 

To elaborate the effects of contraception on the hormone behavior (the last 17 female subjects 
are known to use contraception), a pairwise comparison analysis is performed. For every two 
hormone curves, we estimate the posterior probability that they share the same local cluster on a 
given day, which is then averaged over days in a given interval. It is found that the hormone levels 
among these women are almost indistinguishable in the first 20 days (with the clustering-sharing 
probabilities in the range of 75%), but in the last 4 days, they are sharply separated into two distinct 
regimes (with the clustering- sharing probability between the two groups are droppe d to 30%). 
W e compare our approach to th e hybrid Dirichlet process (hybrid-DP) approach (Petrone et al. . 



2009; Nguyen and Gelfandl |2010|) . perhaps the only existing approach in the literature for joint 



modeling of global and local clusters. The data are given to the hybrid-DP as the replicates of 
a random functional curve, whereas in our approach, such functional identity information is not 
used. In other words, for us only a collection of hormone levels across different time points are 
given (i.e., the subject ID of hormone levels are neither revealed nor matched with one another 
across time points). For a sensible comparison, the same prior specification for base measure H of 
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the global clusters were used for both approaches. The inference results are illustrated in Fig. |9j 
A close look reveals that the global clusters obtained by the hybrid-DP approach is less faithful 
to the contraceptive/no contraceptive grouping than ours. This can be explained by the fact that 
hybrid-DP is a more complex model that directly specifies the local cluster switching behavior for 
functional curves. It is observed in this example that an individual hormone curve tends to over- 
switch the local cluster assignments for u > 20, resulting in significantly less contrasts between 
the two group of women (see Fig. [10] and [121). This is probably due the com plexity of the hybrid- 
DP, w hich can only be overcome with more data (see Propositions 7 and 8 of Nguyen and Gelfand 
(|2010l) for a theoretical analysis of this model's complexity and posterior consistency). Finally, it 
is also worth noting that the hybrid-DP approach pr actically requires the number of clusters to be 
specified a priori (as in the so-called /c-hybrid-DP in Petrone et al.1 (|2009|) ). while such information 
is directly infered from data using the nHDP mixture. 



6 Discussions 

We have described a nonparametric approach to the inference of global clusters from locally dis- 
tributed data. We proposed a nonparametric Bayesian solution to this problem, by introducing the 
nested Hierarchical Dirichlet process mixture model. This model has the virtue of simultaneous 
modeling of both local clusters and global clusters present in the data. The global clusters are 
supported by a Dirichlet process, using a stochastic process as its base measure (centering distri- 
bution). The local clusters are supported by the global clusters. Moreover, the local clusters are 
randomly selected using another hierarchy of Dirichlet processes. As a result, we obtain a col- 
lection of local clusters which are spatially varying, whose spatial dependency is regulated by an 
underlying spatial or a graphical model. The canonical aspects of the nHDP (because of its use 
of the Dirichlet processes) suggest straightforward extensions to accomodate richer behaviors us- 
ing Poisson-Dirichlet processes (also known as the Pittman-Yor processes), where they have been 
found to be particularly suitable for certain applications, and where our analysis and inference 
methods can be easily adapted. It would also be interesting to consider a multivariate version of 
the nHDP model. Finally, the manner in which global and local clusters are combined in the nHDP 
mixture model is suggestive of ways of direct and simultaneous global and local clustering for 
various structured data types. 



7 Appendix 

7.1 Marginal approach to sampling 

The Polya-urn characterization suggests a Gibbs sampling algorithm to obtain posterior distribu- 
tions of the local factors 6> M j's and the global factors if) t % by integrating out random measures Q 
and G u 's. Rather than dealing with the 6> ui 's and t/> t directly, we shall sample index variables t ui 
and k t instead, because 9 U i's and t/> t 's can be reconstructed from the index variables and the & 's. 
This representation is generally thought to make the MCMC sampling more efficient. Thus, we 
construct a Markov chain on the space of {t, k}. Although the number of variables is in principle 
unbounded, only finitely many are actually associated to data and represented explicitly. 
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A quantity that plays an important role in the computation of conditional probabilities in this 
approach is the conditional density of a selected collection of data items, given the remaining data. 
For a single observation i-th at location u, define the conditional probability of y ui under a mixture 
component 4> uk , given t, k and all data items except y u f. 

yul( _ J F(yui\(puk)UuH^ur,z ulx ,=k F (yu'^\<Pu'k)H(^ k )d^ k 

Similary, for a collection of observations of all data y ui such that t U i = t for a chosen t, which we 
denote by vector y t , let f k Vt (y t ) be the conditional probability of y t under the mixture component 
4> k , given t, k and all data items except y t . 

Sampling t. Exploiting the exchangeability of the t^s within the group of observations indexed by 
u, we treat t ui as the last variable being sampled in the group. To obtain the conditional posterior 
for t ui , we combine the conditional prior distribution for t ui with the likelihood of generating 
data y ui . Specifically, the prior probability that t ui takes on a particular previously used value t is 
proportional to n~™ ', while the probability that it takes on a new value t new = m u + 1 is proportional 
to a u . The likelihood due to y ui given t ui = t for some previously used t is f~ k ui (y U i) ■ Here, 
k = k t . The likelihood for t ui = t new is calculated by integrating out the possible values of k t mw: 

K 

p(y ui \t~ ui , t m = t new , k, Data) = -^-fuk Vui (yu t ) + -^—fu^iVud, d9) 

fc=i + 7 9- + 7 

where f~ k y ^{y U i) = J F(y ui \<f) u )H u (<f) u )d<f) u is the prior density of y ui . As a result, the conditional 
distribution of t ui takes the form 



i 1 h Ui (Vui) if t previously used 

P (t ul = t\t~ u \ k, Data) cx { _ * ^\^[ _ „ . c ± +new (20) 



a uP (y m \t- u %t ui = t new ,fc) if t = it 
If the sampled value of t ui is t new , we need to obtain a sample of k t n™ by sampling from Eq. (Tl9l) : 



(lf uk lZ(yui) ifk = k new . 

Sampling k. As with the local factors within each group, the global factors i/> t 's are also exchange- 
able. Thus we can treat i\) t for a chosen t as the last variable sampled in the collection of global 
factors. Note that changing index variable k t actually changes the mixture component membership 
for relevant data items (across all groups u) that are associated with i\) t , the likelihood obtained by 
setting k t = k is given by f k Vt (y t ), where y t denotes the vector of all data y ui such that t ui = t. 
So, the conditional probability for k t is: 



Q.kfk Vt i.Vt) if A; previously used 
where f k ^{y t ) = J Ilui:t ul =t F (yui\(i>u)H((f>)d<j) 



p(kt = fc|t,Jfe _t ,Data) oc ^ Kat ' r J ' (22) 

\&(y t ) Hk = k™, 
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Sampling of 7 and a . We fo llow the method of auxiliary variables developed by Escobar and West 



dl995b and iTeh et all d2006h . Endow 7 with a Gamma(a 7 , b 7 ) prior. At each sampling step, we 



draw r] ~ Beta(7 + l,q.). Then the posterior of 7 is can be obtained as a gamma mixture, which 
can be expressed as 7r 7 Gamma(a 7 + K, 6 7 — log(r/)) + (1 — 7r 7 )Gamma(a 7 + K — 1, 6 7 — log(ry)), 
where 7r 7 = (a 7 + A — l)/(a 7 + A' — 1 + g.(6 7 — log(r/))). The procedure is the same for each 
a u , with n u and m u playing the role of q. and K, respectively. Alte rnatively, one can force all a u 



to be equal and endow it with a gamma prior, as in lTeh et al.l(|200q) 
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